Topic modeling of the fMRI literature¶

In this analysis we take results obtained by applying BERTopic to text from PubMed abstracts that match the following query:

'("fMRI" OR "functional MRI" OR "functional magnetic resonance i
maging") AND (brain OR neural OR neuroscience OR neurological OR psychiatric OR 
psychology) AND %d[DP]' % year

We focus here on the years 2002 through 2022; we excluded earlier years because the number of matching pubmed abstracts for those earlier years was quite small.

First we load the relevant functions:

In [31]:
from fmritopics.fit_dynamic_topic_model import load_data, get_embeddings
from fmritopics.analyze_dynamic_topics import (
    load_model,
    get_topics_over_time,
    get_hierarchical_topics,
    plot_hierarchical_topics,
    get_top_topics_over_time,
    plot_top_topics,
    get_slopes
)
from collections import defaultdict
import pandas as pd
import umap
import umap.plot
import numpy as np
import matplotlib.pyplot as plt

Now we need to select the topic modeling parameters. There are two main parameters that determine the number of topics in BERTopic: The number of neighbors used in the UMAP dimensionality reduction step, and the minimum cluster size for the HDBSCAN clustering step. We ran models with various values of these parameters; while the overall conclusions are robust to these different parameters, specific differences in topics and their changes over time did occur for some values.

In [2]:
min_cluster_size = 250
n_neighbors = 25
datadir = 'data'
modeldir = 'models'
In [3]:
sentences, years = load_data(datadir)

minclust, nneighbors = min_cluster_size, n_neighbors

topic_model, embedding_model, model_name = load_model(minclust, nneighbors, modeldir)

topics_over_time = get_topics_over_time(sentences, years, topic_model)

hierarchical_topics, tree = get_hierarchical_topics(topic_model, sentences, viz=True)
File data/bigrammed_cleaned_abstracts_1991.pkl does not exist
Loaded model from models/model-bertopic_minclust-250_nneighbors-25_gpt4/model-bertopic_minclust-250_nneighbors-25_gpt4
getting topics over time
32it [00:09,  3.36it/s]
100%|██████████████████████████████████████████| 52/52 [00:00<00:00, 210.05it/s]

We can visualize the hierarchical relationships between the different topics. To label each of the topics, we use GPT-4 to generate labels based on the text in the documents associated with each topic.

Note that the long labels get chopped off and there is no easy way to fix this, but the meaning of most topics is still pretty clear.

In [4]:
fig = topic_model.visualize_hierarchy()
fig

We can also look at an embedding of the multidimensional space of relationships between topics using dimensionality reduction (via the UMAP technique). The following cell will create a visualization of documents and topics; it is easiest to see the individual topics by sliding the slider to the right in order to show fewer topics. We will look at this kind of plot in more detail later in the notebook.

In [5]:
embeddings, embedding_model = get_embeddings(sentences)
fig, reduced_embeddings = plot_hierarchical_topics(
    topic_model, embeddings, sentences, hierarchical_topics,
    min_cluster_size, n_neighbors)

fig
using existing embeddings from data/embeddings.pkl

Next we want to look at how the topics change over time, focusing on the most common topics. Note that we have removed the global (or "outlier") topic, which is composed of a broad set of documents (about 39%) matching fairly generic terms which were not similar enough to any of the topics to be categorized with them. It is however interesting to see how the words associated with this generic topic change over time, from an earlier focus on task activation studies to a later focus on connectivity and network models.

In [6]:
topics_over_time.query('Topic == -1')
Out[6]:
Topic Words Frequency Timestamp Sum Probability
291 -1 motor, cortex, task, movement, visual 365 2002-01-01 901 0.405105
342 -1 cortex, task, motor, visual, movement 447 2003-01-01 1157 0.386344
395 -1 cortex, task, visual, stimulus, neural 517 2004-01-01 1415 0.365371
449 -1 cortex, task, visual, motor, stimulus 722 2005-01-01 1843 0.391753
503 -1 cortex, task, visual, stimulus, motor 849 2006-01-01 2143 0.396174
557 -1 cortex, task, visual, motor, stimulus 965 2007-01-01 2445 0.394683
611 -1 cortex, task, stimulus, visual, neural 1018 2008-01-01 2539 0.400945
665 -1 cortex, task, neural, visual, stimulus 1122 2009-01-01 2967 0.378160
719 -1 cortex, task, neural, visual, stimulus 1280 2010-01-01 3413 0.375037
773 -1 cortex, task, visual, neural, motor 1524 2011-01-01 3870 0.393798
827 -1 cortex, task, neural, network, visual 1732 2012-01-01 4425 0.391412
881 -1 cortex, task, neural, network, visual 1965 2013-01-01 4977 0.394816
935 -1 cortex, task, neural, network, connectivity 2046 2014-01-01 5230 0.391205
989 -1 cortex, network, task, neural, visual 2127 2015-01-01 5414 0.392870
1043 -1 network, cortex, neural, connectivity, task 2111 2016-01-01 5236 0.403170
1097 -1 network, connectivity, cortex, neural, task 2217 2017-01-01 5400 0.410556
1151 -1 network, connectivity, cortex, neural, task 2245 2018-01-01 5457 0.411398
1205 -1 network, connectivity, cortex, neural, task 2200 2019-01-01 5521 0.398479
1259 -1 network, connectivity, neural, task, cortex 2320 2020-01-01 5763 0.402568
1313 -1 network, connectivity, neural, cortex, cognitive 2446 2021-01-01 6146 0.397982
1367 -1 network, connectivity, neural, cognitive, cortex 2254 2022-01-01 5732 0.393231

For the rest of the analyses we will exclude the general topic from the analyses.

We next wanted to generate a visualization of how the top topics changed over time. To do this, we identified all topics that were in the top three most prevalent topics in any year, which resulted in a set of nine topics for this particular set of parameters. Here is a list of names generated by GPT-4 based on the text in the documents associated with each of the top topics:

In [9]:
top_topics_over_time = get_top_topics_over_time(topics_over_time, topic_model)

top_topics_over_time.query("Sum == 901").Name.tolist()
Out[9]:
['Schizophrenia: Genetic Connectivity and Cognitive Dysfunction in Patients',
 'Neuronal Plasticity and Seizures in Epilepsy Patients',
 "Neural and Linguistic Mechanisms in Children's Reading Development",
 'Neuroimaging and Mechanisms of Chronic Pain',
 'Dynamic Connectivity and Network Structure in Neuroimaging',
 'Neuroimaging and Connectivity in Major Depressive Disorder (MDD)',
 'Medial Temporal Lobe Episodic Memory Retrieval and Encoding',
 'The Impact of Food Intake and Reward on Obesity and Eating Habits',
 'Cognitive Performance, Interference, and Neural Mechanisms in Working Memory Tasks']

Now we can plot how these change over time.

In [10]:
# these are manually generated vertical offsets for the labels,
# which will differ between models
offsets = defaultdict(lambda: None)
offsets[(250, 25)] = defaultdict(lambda: 0)
offsets[(250, 25)][3] = -.003
offsets[(250, 25)][5] = .002
offsets[(250, 25)][1] = -.001
offsets[(250, 25)][2] = .001
offsets[(250, 25)][7] = -.001

plot_top_topics(topics_over_time, topic_model, 
                    minclust, nneighbors, offset=offsets[(min_cluster_size, n_neighbors)])
No description has been provided for this image

We can see that there are some topics that rise substantially over the course of the last two decades, and others that fall. We can quantify this by computing the slope of each trend line and sorting by those values:

In [11]:
slope_df = get_slopes(top_topics_over_time, topic_model)


pd.set_option('display.max_colwidth', None)
slope_df
Out[11]:
topic slope topicname
2 2 -0.001690 Neural and Linguistic Mechanisms in Children's Reading Development
8 11 -0.001617 Cognitive Performance, Interference, and Neural Mechanisms in Working Memory Tasks
6 6 -0.001001 Medial Temporal Lobe Episodic Memory Retrieval and Encoding
1 1 -0.000599 Neuronal Plasticity and Seizures in Epilepsy Patients
3 3 0.000084 Neuroimaging and Mechanisms of Chronic Pain
0 0 0.000707 Schizophrenia: Genetic Connectivity and Cognitive Dysfunction in Patients
7 7 0.001068 The Impact of Food Intake and Reward on Obesity and Eating Habits
4 4 0.001199 Dynamic Connectivity and Network Structure in Neuroimaging
5 5 0.001480 Neuroimaging and Connectivity in Major Depressive Disorder (MDD)

This shows that three of the four topics showing negative slopes were related to basic cognitive neuroscience (reading, episodic memory, and working memory) while four of the five topics showing positive slopes were related to clincal disorders. Inerestingly, the only methodological topic to emerge for this particular parameter set was dynamic connectivity modeling, which showed an increasing trend over time.

Now let's look at the relationships between these topics. We can take the original embeddings of the text (which have 384 dimensions) and embed them in a two-dimensional space for visualization, using the UMAP technique (which is also used under the hood by BERTopic).

In [12]:
mapper = umap.UMAP(n_neighbors=25, 
                   n_components=2, 
                   min_dist=0.1, metric='cosine').fit(embeddings)

First, let's plot the entire space with all topics:

In [19]:
topic_array = np.array(topic_model.topics_)
umap.plot.points(mapper, theme='fire', labels=topic_array, color_key_cmap='Paired',
                 background='black', show_legend=False)
Out[19]:
<Axes: >
No description has been provided for this image

This looks pretty cool but isn't very useful for actually understanding what's going on.

It's more useful to plot only the top topics in this two-dimensional space; each point in this figure is one document, labeled by the topic to which it was assigned.

In [32]:
topicnames = np.array([topic_model.get_topic(i)[0][0] for i in topic_array])
years_array = np.array(years)

subset_points = [True if topic_array[i] in slope_df.topic and years_array[i] > 2001 else False for i in range(len(topic_array))]
plt.figure(figsize=(12,10))
ax = umap.plot.points(mapper, labels=topicnames, color_key_cmap='Paired', 
                 background='black', subset_points=subset_points)
legend = ax.get_legend()
legend.set_bbox_to_anchor((1, 1), transform=ax.transAxes)
<Figure size 1200x1000 with 0 Axes>
No description has been provided for this image

Another thing we can look at is how the space changes over the years. To do this, let's plot the same data but color the data points by the year of publication:

In [35]:
plt.figure(figsize=(8,10))

ax = umap.plot.points(mapper, labels=years_array, cmap='viridis', ax=plt.gca(),
                 background='black', subset_points=subset_points)
legend = ax.get_legend()
legend.set_bbox_to_anchor((1, 1), transform=ax.transAxes)
No description has been provided for this image

This shows that the more recent topics seem to be somewhat more central to the clusters, showing that papers within a topic become more similar over time.

In [ ]: